Statistical deconvoluting of mixtures

ABSTRACT

Statistical classification of activities of molecules is a computer implemented methodology of QSAR employing visualization of molecular features and statistical techniques for correlating features of molecules with their observed biological properties. Each molecule is described by noting the presence (1) or absence (0) of a feature of interest. The identification of specific features coded by 1&#39;s or 0&#39;s is accomplished by recursive partitioning. The data sets are planned or unplanned. The method is also applicable to classification of individuals in biological populations on the basis of their genetic makeup.

BACKGROUND OF THE INVENTION

[0001] A portion of the disclosure of this patent document containsmaterial which is subject to copyright protection. The copyright ownerhas no objection to the facsimile reproduction by anyone of the patentdocument or the patent disclosure, as it appears in the Patent andTrademark Office patent file or records, but otherwise reserves allcopyright rights whatsoever.

[0002] This invention relates generally to computer assisted methods ofanalyzing chemical or biological activity and specifically to computerassisted methods of determining chemical structure-activityrelationships, and determining which species in a mixture from achemical or biological population can be predicted to have a givenbiological activity or biological phenotype. This method is particularlyuseful in the fields of chemistry and genetics.

[0003] Combinatorial chemistry and high-throughput screening (HTS) arehaving a major impact on the way pharmaceutical companies identify newtherapeutic lead chemical compounds. Voluminous quantities of data arenow being produced routinely from the synthesis and testing of thousandsof compounds in a high-throughput biochemical assay. The construction ofchemical libraries has, in effect, replaced the painstaking individualsynthesis of compounds for biological testing with a strategy for themultiple synthesis of many compounds about a common structural corescaffold. Since there is such a low probability of identifying new leadcompounds from screening programs, it is expected that the sheer numberof compounds made via a combinatorial approach will provide many moreopportunities to find novel leads. However, making and testing thousandsof compounds instead of fifty to one hundred per chemist per year hasplaced a tremendous strain on the logistical and computationalinfrastructure usually relied upon to store and analyze these datasets.Methods, developed in the last decade, for the statistical analysis of arelatively small number of compounds (less than 100) are not suitablefor use on much larger data sets. Consequently, new technologies must beinvestigated.

[0004] Various methods for the storage and retrieval of chemicalstructure/biological activity data have been devised. Software productsare now available from major vendors that address most of the logisticalneeds of combinatorial chemistry. Little thought, however, has beengiven to how the data might best be used to guide future syntheticefforts once the biological activity of chemical compounds has beenlearned. One possible result from the synthesis and testing of largenumbers of compounds is a short list of promising new lead compounds forfurther consideration. Many research programs stop here and immediatelyrevert to traditional synthesis in order to optimize the new leads. Onthe other hand, others are seeking to continue along a combinatorialpath have employed an evolutionary approach to make best use of all thedata.

[0005] Genetic algorithms have also been used to select new chemicallibraries to be made. However, due to the complex and specialized natureof the software used to identify 3D pharmacophores, it is unlikely thatthese methods will be able to routinely handle the volume of data and/orpossible multiple binding modes or sites.

[0006] For a number of years, there has been an interest in usingartificial intelligence methods to deconvolute, uncover hidden rulesfrom, or otherwise classify chemical datasets. Most have focused onreaction prediction. Others have used neural networks, fuzzy adaptiveleast squares and the like to analyze structure-activity datasets orpredict chemical properties. Most of these methods are generally muchtoo complex for routine structure-activity-relationship (SAR) analysisof large heterogenous data sets.

[0007] Recursive partitioning (RP) is a simple, yet powerful,statistical method that seeks to uncover relationships in large datasets. These relationships may involve thresholds, interactions andnonlinearities. Any or all of these factors impede an analysis that isbased on assumptions of linearity such as multiple linear regression (orbasic QSAR), principal component regression (PCR), or partial leastsquares (PLS). Various implementations of RP exist but none have beenadapted to the specific problem of generating SAR. The present inventionfeatures a new computer program, Statistical Classification of Moleculesusing recursive partitioning (SCAM), to analyze large numbers of binarydescriptors (which are concerned only with the presence or absence of aparticular feature) and to interactively partition a data set intoactive classes.

SUMMARY OF THE INVENTION

[0008] In brief summary, the invention is a computer-based method ofencoding features of mixtures, whether the features be of individualdata objects in a mixture or features of mixtures themselves, and ofidentifying and correlating those individual features to a responsecharacteristic that is a trait of interest of the individual data objector of the mixture. The method is applicable to data objects in thosetypes of data sets that are characterized in being a mixture of dataobject classes, each data object class containing one or more of thedata objects, and wherein multiple data objects present a same trait ofinterest, but classes of data objects produce the responsecharacteristic that is a trait of interest through different underlyingmechanisms. The method comprises the steps of: assembling a set ofdescriptors and converting said set of descriptors into the form of abit string such that each descriptor reflects the presence or absence ofa potentially useful feature in a data object of interest; examiningeach data object for presence or absence of each of said descriptors;assembling the results of looking for descriptors into a vector for eachdata object, noting the presence or absence of each feature in said dataobject; assembling all vectors thus generated into a matrix; dividingthe data in said matrix into two daughter sets on the basis of presenceor absence of a given descriptor from said set of descriptors; anditeratively repeating this step until each member of said mixture hasbeen classified into a group. The method is applicable to three broadsituations. Firstly, those situations in which data objects are unique,but the data set is a mixture in the sense that the data objects act indifferent ways, e.g. a population of human patients having differentbiological genotypes that nonetheless lead to a phenotypically identicalclinical disease diagnosis. Secondly, those situations in which the dataobjects are themselves mixtures, e.g. a mixture of k chemical compoundstested together in a high throughput screen, or a mixture of differentstructural modes of a compound, and those data objects that show a givenactivity of interest do so in the same fashion or through the sameunderlying mechanism of action. And thirdly, those situations in whichthe data objects are mixtures and the active elements in the mixturesproduce the same activity, but are acting through different mechanisms,for example, where k chemical compounds are screened together foractivity and two of the compounds bind to a biological receptor, butbind to it in different places or in different conformations. Each ofthese three types of situations can be addressed whether they areplanned or inadvertent mixtures. A planned mixture occurs where the factof being a mixture is capable of manual control as is the case withcarrying out a combinatorial synthesis, or where a high throughputscreening is carried out with, for example, 20 compounds test together.An inadvertent mixture is said to be present whenever it is inherent inthe situation, for example where there are multiple structuralconformations of a chemical compound, or where a data set containscompounds producing the same chemical result but acting by differentmechanisms, or where a data set contains compounds producing the samebiochemical result, but binding to different receptor sites or places,or where the data set is a human population having the same clinicaldisease, but the individuals have different genetic types coding fordifferent underlying pathologies.

BRIEF DESCRIPTION OF FIGURES

[0009]FIG. 1 is a schematic illustration of the process to identifyimportant features of individual compounds in a mixture.

[0010]FIG. 2 is a schematic illustration of the process to identifyimportant features of a mixture and identify active components.

[0011]FIG. 3 is a schematic illustration of the process to identifyactive component(s) of a mixture and the features associated withbiological activity of chemical structures.

[0012]FIG. 4 is an illustration of a matrix having multiple vectorsrepresenting compounds.

[0013]FIG. 5 is an illustration of an analysis tree (also known as aPachinko tree) generated using recursive partitioning as part of theinvention in order to classify structural features of a group ofchemical compounds.

[0014]FIG. 6 is an illustration of an analysis tree generated usingrecursive partitioning as part of the invention in order to classifygenetic features of a population.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS AND BEST MODE OF THEINVENTION

[0015] The method of the present invention overcomes previousshortcomings in the chemical and biological arts. In a first preferredembodiment, Structure-activity relationships (SAR's) can be developedfrom large bodies of data generated as a result of high throughputscreening (HTS), or combinatorial or other automated chemical syntheses.Such chemical syntheses outputs data sets composed of large numbers ofstructurally heterogeneous chemical compounds.

[0016] First, a set of descriptors is generated. Descriptors, as thatterm is used in the present invention, are any type of descriptivenotation that, in the context of chemistry, are chemicallyinterpretable, have enough detail that they can capture useful chemicalstructural features, and are capable of being described in terms ofbeing present or absent in a given chemical compound, which in turnconfers the ability to describe them computationally as a bit string. Apartial, non-limiting list of descriptors can include: atom pairs, whichset forth a spatial-qualitative relationship between any two atoms in amolecule; atom triples, which set forth a spatial-qualitativerelationship between any three atoms in a molecule; descriptions ofmolecular fragments; descriptions of molecular topological torsions; anybinary of continuous variables; or any combination of any of thesestypes of descriptors. In the context of biology, a descriptor can be,most preferably, a genetic marker, such that an individual subject in apopulation of interest either does or doesn't have the marker or aparticular allele of a gene.

[0017] For any of the above-listed descriptors, or any non-listeddescriptors that otherwise fit the above stated criteria, it can readilybe seen that for any single chemical compound under consideration, itcan be stated that the compound either has or doesn't have thedescriptor. This presence or absence of such a descriptor for a compoundcan be represented computationally as a bit string, by a series of 1'sor 0's, each representing presence or absence, respectively, of a givendescriptor for the compound under consideration. Multiple descriptors ofa given type are generated, and each chemical compound is comparedagainst each descriptor for the presence or absence of each descriptorin the specified set of descriptors that can occur in a data set. Thiscomparison process yields a bit string of 1's and O's, as the case maybe, that constitute a vector. The vector's sequence of 1's and 0's willbe an identifier of the compound under consideration, defining it interms of the set of descriptors that occur in the data set.

[0018] Two types of descriptors can be exemplified. Atom pairs and atomtriples are descriptors generated from the topological (2D)representation of a molecular structure. They are very simpledescriptors composed of atoms separated by the minimal topologicaldistance (i.e., the number of bonds) between them, or equivalently, thenumber of atoms in the shortest path connecting the atoms. Each localatomic environment is characterized by three values: the atomic number,the number of non-hydrogen connections and one-half of all associatedπ-electrons. For example, the carbonyl carbon in acetone is encoded as[C, 3, 1] whilst a terminal methyl carbon would be [C, 1, 0]. The codefor the carbonyl oxygen is [O, 1, 1]. Thus, for each structure,(n(n−1))/2 atom pairs (where n is the number of non-hydrogen atoms in astructure) are generated by considering each atom and the minimaltopological distance to every other atom in turn. A bit-stringindicating the presence or absence of a particular atom pair was thenproduced. In general, approximately ten thousand unique types of atompairs are generated for a typical data set of about one thousandstructures.

[0019] The second type of structural descriptor, atom triangles, or atomtriples, have been used by several groups for molecular similaritysearching and as search keys for 3D search and docking studies.Triangles of atoms with corresponding interatomic distance informationare thought to be the most elemental portions of a pharmacophore. Ouratom triangles differ from those previously defined. As an indication ofinteratomic distance, we consider only the length of the shortest pathbetween each pair of atoms forming the triangle. For example, thetriangle formed amongst the carbonyl oxygen and the two terminal methylsof acetone is [O,1,1] (2); [C,1,0] (2); and [C,1,0] (2). All possibletriangles are generated and each is properly canonicalized to a uniqueform and then transformed into a bit string as with atom pairs. Often,depending upon the diversity and size of the data set, it is possible togenerate hundreds of thousands to millions of unique atom triples. For a90,000 compound data set there are on the order of over 2 millionpossible atom triples.

[0020] A bit string is built computationally as long as the number ofdistinct features, e.g., atom triples, in an initially specified dataset. The bit string is initially populated with 0's. Any given 0 ischanged to a 1 if a compound being examined has at least one atom tripleof the type assigned for that position in the bit string. As multiplecompounds are thus examined, a matrix of the type shown in FIG. 4 iscreated, consisting of 1's and 0's. Such a matrix can grow to extremelylarge size, with over 2,000,000 descriptors not being uncommon. However,since most of the positions will be 0's, denoting the absence of adescriptor for that compound, this means the matrix is sparse. A sparsematrix is computationally handled in the present invention by onlykeeping track of where the 1's are, and imputing the positions of the0's, thus compressing the bit string and saving an enormous amount ofcomputer memory. The bit string is subsequently decompressed whennecessary.

[0021] In the meantime, an empirically obtained database of the potency(for some chemical or pharmacological reaction of interest) of each ofthe compounds or mixtures being examined has been assembled. Taking thedata consisting of the assembled 1's and O's in the matrix and the knownpotency for each compound, the task is to divide the data into twogroups, with data objects with 1's assigned to one group and dataobjects with O's assigned to the other, thus effectively splitting thedata into less active and more active compounds.

[0022] The best column to use to divide the data set must be found. Thisoptimal column is found through the use of the tool known as recursivepartitioning (RP). RP analysis generates a diagram as exemplified inFIG. 5. In the diagram in FIG. 5, the node at the top of the tree isdesignated as Node 0. It represents a population or set of 1650compounds, some of which are active, but many of which are inactive,whose potency was previously determined (active compounds are assigned ascore of 1, 2 or 3, while inactive compounds are assigned a score of 0),and as a group is now said to have an average potency of 0.34. Ingeneral, the number of screened compounds needed to build a analysistree of this type is at least 100 or more, with 200 or more beingpreferred and 1,000 or more being most preferred. Immediately under Node0 is a description of an atom triple, C(1,2)-8-; C(2,1)-6-; andC(1,0)-5-. The RP algorithm examines the difference in potency betweengroups where each triple (or any other descriptor) is present or absent.The RP algorithm has identified this triple as being the best atomtriple to partition off active compounds from inactive compounds in thegroup of 1650, since this triple results in the largest possibledifference in average potency between all possible presence/absencepairs, the difference with the smallest p-value using a statisticaltest. The algorithm has here split off 37 compounds having this triple,and 37 is the number that appears in the next lower node to the right ofNode 0 (all compounds not having this triple are split off to the left).These 37 compounds have an average potency of 2.8, out of a maximumpossible of 3. Thus, the algorithm has already identified an atom triplethat is a chemical structure feature tending to confer a high degree ofchemical reactivity on this class of compounds, and a structure-activityrelationship begins to emerge. The RP algorithm next identifies the atomtriple C(1,2)-4-; N(3,0)-2-; C(2,0)-3- as being the next best atomtriple to partition off active compounds from inactive compounds in theremaining group of 37. This round of partitioning results in twocompounds lacking the triple being split off to the left, and theremaining 35 compound being split off to the right. The two compoundssplit off to the left have no activity, while the other 35 compoundshave an average activity of 2.94 out of a possible 3, as stated in thelowermost right side node, call a terminal node). Now astructure-activity relationship is seen in which the presence of the twodefined triples reflects a high degree of average potency in thecompound subgroup. A typical molecular structure bearing these two atomtriples is given, and it can be said with relative confidence thatmolecules having this general structural core will be active in thescreen of interest here (atoms marked with circles are those that belongto the defining atom triples for that node).

[0023] However, it can be seen that two other good terminal nodes showedup in this analysis, resulting in three chemical structureclasses beinggenerated in FIG. 5. When the first round of partitioning took place,the algorithm took the remainder of 1613 compounds and identified anatom triple tending to confer activity within that group, C(3,0)-2-;N(1,2)-2-; N(1,2)-3-, and partitioned that subgroup accordingly into twosubgroups having average potencies of 2.3 and 0.23, reflecting thepresence or absence of that atom triple. The partitioning processcontinues until terminal nodes were reached, yielding threestructure-activity relationships. These three structural cores can beseen to have somewhat different chemistries. Thus, the original activityof the group of 1650 may be the result of different biochemical/chemicalmechanisms. RP can deal with such mixtures of compounds that followdifferent mechanistic paths.

[0024] Having developed such a tree, it is then possible to predict theactivities of compounds that have not yet been empirically tested foractivity. A given compound is analyzed for presence or absence oftriples, or whatever the descriptor is that has been chosen, and thencascaded down the tree with the help of a software tool that is part ofthe present invention, which is designated as Pachinko. Having examinedthe compound for the presence or absence of those descriptors now knownto confer activity, the activity of the compound is electronicallypredicted, eliminating the need for high throughput screening of largenumbers of compounds which will not have a desired threshold ofactivity. Those compounds having the greatest predicted activity areselectively tested, at great cost and time savings.

[0025] It is important to understand that not only discrete compounds orindividuals can be assigned to passed through nodes in the analysistree, but mixtures themselves as well. Thus, a situation in which 1,000pools each containing 10 different compounds, isomers, conformers, etc.,can be analyzed, in which each pool is now defined and analyzed in termsof descriptors present in the pools. Broadly speaking, discretecompounds or individuals are data objects (an object that itself is nota mixture), but such pools are themselves also each a data object, whichwe refer to as a mixture object for greater clarity (i.e. an object thatis itself a mixture). Whether an object is a data object or a mixtureobject, the object is analyzed in the same fashion using bit stringassembly and recursive partitioning.

[0026] Situations commonly arise in which multiple binding modes existby which several given compounds may be showing the same biologicalpotency, but are doing so by binding to different available bindingsites on a receptor molecule, a common situation in pharmacology. Arelated problem is that of a cell that presents more than one receptorsite such that structurally differing molecules can elicit the samebiological response from the cell. These problems are increased byorders of magnitude when combinatorial testing is carried out. Theproblem here is in figuring out what different structural features outof such a mix can confer activity and applying that knowledge to thedesign or screening of new compounds. The present invention can resolvesuch mixture problems by assembling a set of descriptors that can definea population of compounds and then proceeding with the rest of theanalysis as described to arrive at structure-activity relationship rulesout of the mixture.

[0027] Yet another problem that can be addressed by the presentinvention is that in which pairs of compounds may acting synergisticallyto elicit a chemical or pharmacological response, and where a pluralityof pairs is present in a pool to be analyzed. The method of the presentinvention can be used to find such pairs in a pool and quantify theirrelative activity as synergistic pairs. As set forth above, not onlydiscrete compounds can be analyzed as data objects but also mixtures asmixture objects. Thus, where no individual compounds (objects) decodeinto a node, but one or more pairs of compounds (mixture objects) decodeinto the same node that shows a high average potency, then this resultimplies the discovery of a synergistic pair of compounds, with membersof the the pair having the characteristics of the descriptors leading tothat node. Synergistic triples, etc., of compounds can be found in likemanner.

[0028] In genetics, it is common for a population to have individuals init that are different genotypes. It is now known that a great manydiseases are controlled by not one, but multiple genes in an individual.These two factors present a huge problem in unraveling how to rationallytarget a drug therapy at a population of patients who may have the sameclinical diagnosis, but whose pathology is being controlled by multiplepossibly different genes within each patient. Until now, there has beenno known satisfactory method for the identification of multipleinteracting genes from large genomic data sets. However, the presentinvention addresses this by using alleles or combinations of allelesand/or gene markers as descriptors. Thus, as shown in FIG. 6, a patientpopulation of 1293 individuals had an average disease incidence of 0.61.The RP algorithm selects the gene marker aaxxx, present with two copies,to do a partition. This results in a subgroup of 86 individuals beingsplit off to the right, 83% of whom had disease, while a subgroup of1,207 not having that genetic marker is split off to the left, andhaving a disease incidence of 59%. The analysis is continued untilterminal nodes are reached that lead to the prediction that the highestincidence of disease will occur in those individuals having two copiesof the aaxxx gene but who do not have the gene dbbfyy, which thusappears to be linked to a protector gene that tends to confer protectionfrom disease on an individual, since those that had the putativeprotector gene only had a 30% incidence of disease. Using these results,after obtaining a genetic analysis of an individual's DNA, their chancesof becoming a disease victim can be predicted, and their therapy can betailored accordingly if the drug being used is one which acts upon aprotein expression product of one or more of the genes markers used asdescriptors or a near by gene.

[0029] Since the economics of high throughput screening favor screeningmixtures of compounds, the questions then arise of how to analyze suchpooled data, and how to pool them. In another preferred embodiment ofthe invention, RP can be used to analyze such pooled data.

[0030] Discrete products of a combinatorial synthesis can be encoded anddecoded by use of the present invention, since each vector as describedabove is an identifier of the features of a compound. A given compoundfrom a combinatorial synthesis (especially a virtual synthesis, see U.S.Pat. No. 5,463,564) is electronically dropped down an analysis tree andif it lands in a given terminal node showing high activity, it is nowknown to have both a high probability of activity by virtue of alldescriptors assigned to each node through which it passed successfully.This eliminates screening and identification of the great majority ofcompounds in a virtual combinatorial library, as it is well known thatthe great majority of combinatorial discrete are chemical ‘junk’ thatwill not have any appreciable biological activity, but still have to bewinnowed out of a combinatorial pool, currently at great wasted expense.

[0031] SCAM was the software tool developed as part of the presentinvention to perform recursive partitioning by swiftly computing binarysplits on a large number of descriptor variables. There are severalaspects of implementation to consider. Huge sparse matrices, tens ofthousands of structures and millions of descriptors have to be handled,efficient binary splits on up to a million or more variables have to beroutinely performed, and a useful bridge for the chemist between thestatistical analysis and the actual structures have to be devised.

[0032] Three files are produced prior to a SCAM analysis: (1) a datafile containing the compound names and potencies; (2) a descriptordictionary file containing a contextual decoding of each descriptorvariable; and (3) a binary file containing a record for each structurethat lists all computed descriptors. To conserve memory, a sparsestorage format is employed that correlates each descriptor with a listof the structures in which the descriptor is found is stored. This isvery similar to the concept of indirect keys used in substructuresearch. An alternative is to store a list of descriptors that are foundin each structure. However, the former is more efficient, since thet-test is performed on the activities of the structures associated witha particular descriptor.

[0033] In contrast to data partitioning via continuous descriptorvariables, binary classification trees can be computed very quickly andefficiently since there are far fewer and much simpler computationsinvolved. For example, FIRM develops rules for splitting based on“binning” of continuous variables and amalgamating contiguous groups ofvariables. These processes add considerably to execution time andeffectively limit the interactive nature of most general RP packages forlarge data sets. However, with binary data a parent node can only besplit into two and only two daughter nodes. Splitting on a binarydescriptor such as the presence or absence of an atom pair involvesperforming a t-test between the mean of the group that has the atom pairand the group that does not. The t-values for each rule as a potentialsplit can then be compared using the largest t-statistic. The atom pairwith the largest t-statistic is the splitting variable. Therefore, thep-value (a time-consuming part of the calculation) needs only to becomputed for the most significant split. Adding to the speed is the factthat, frequently, either the group that has the atom pair or the groupthat does not have the atom pair is usually quite small. This fact canbe exploited using an idea known as “updating” which can be applied to awell known expression for computing the sample variance. If one denotesthe potencies in group 1 by x₁, x₂, . . . , x_(m) and group 2 by y₁, y₂,. . . , y_(n) and assuming that group 1 is smaller than group 2 (m<n),the t-statistic for testing for a difference between group potency meansis:${T = \frac{\frac{\overset{\_}{x} - \overset{\_}{y}}{\sqrt{\frac{1}{m} + \frac{1}{n}}}}{\sqrt{\frac{{SSX} + {SSY}}{n + m - 2}}}},{\text{where}\quad \begin{matrix}{{{SSX} = {\sum\limits_{i = 1}^{m}\left( {x_{i} - \overset{\_}{x}} \right)^{2}}},{\overset{\_}{x} = {{SX}/m}},{{SX} = {\sum\limits_{i = 1}^{m}x_{i}}}} \\{{{SSY} = {\sum\limits_{i = 1}^{n}\left( {y_{i} - \overset{\_}{y}} \right)^{2}}},{\overset{\_}{y} = {{SY}/m}},{{SY} = {\sum\limits_{i = 1}^{n}y_{i}}}}\end{matrix}}$

[0034] Next, let z₁, z₂, . . . z_(m+n), denote the potencies in theparent node. The sum, SZ, was computed for the previous split so it isavailable. Therefore, after computing SX, SY can be computed as thedifference SY=SZ−SX. This technique is known as “updating”.

[0035] A similar updating method can be used to compute SSX and SSY.Note that:${SSX} = {{\sum\limits_{i = 1}^{m}x_{i}^{2}} - {\overset{\_}{x}}^{2}}$${SSY} = {{\sum\limits_{i = 1}^{n}y_{i}^{2}} - {\overset{\_}{y}}^{2}}$

[0036] so SSY can be computed using the sum of the data, SY, and the sumof the squared data which will be denoted by SYY. Having computed SXX,and having SZZ available, SYY can be computed by the relationSYY=SZZ−SXX. Therefore, the t-statistic can be computed very quickly,having stored the sum of the data and the sum of the squared data fromthe previous split.

[0037] The partitioning is repeated until a stop criteria is met.Firstly, the process can stop if there is no statistical test (t-test ispreferred) that achieves a specified level of statistical significance.Secondly, the process can stop if the mixtures in a node are homogeneouswith respect to their measured property. Thirdly, the process can stopif the size of each terminal node is below a user specified value.

EXAMPLE ANALYSIS

[0038] Use of RP to uncover substructural rules that govern thebiological activity of a set of 1,650 monoamine oxidase inhibitors(MAOI's).

[0039] A series of 1,650 MAOI's was used to illustrate the effectivenessof SCAM in analyzing large structure-activity datasets and producing SARrules. Neuronal monoamine oxidase [amine:oxygen oxidoreductase(dcaminating) E.C. 1.4.3.4] inactivates neurotransmitters such asnorepinephrine by converting the amino group to an aldehyde. Inhibitorsof this enzyme are thought to be useful in the treatment of depressionand were introduced into therapy in 1957 with the drug pargyline.However, due to toxicity concerns and interactions with other drugs andfood, they are now only occasionally used. Yet, there is continuedinterest by pharmaceutical researchers of MAO as a target for rationaldrug design in anti-depressant therapy. Biological activities werereported in four classes of MAOI's: 0 being inactive; 1, somewhatactive; 2, modestly active and 3, being most active. Generating any typeof QSAR from this dataset would previously have been considered by thoseof skill in the art to be relatively quite difficult, but use of thepresent invention in statistically determining SAR rules is now possibleand relatively easy.

[0040] Recursive partitioning was applied to this set of 1,650activities and unique atom pairs and the resulting tree diagram is shownin FIG. 1. Default settings were used to produce this tree: up to 10levels of partitioning are allowed, each split is statisticallysignificant (Bonferroni adjusted p-value<0.01), and both positive andnegative splits were allowed. The Bonferroni p-value is computed bymultiplying the raw p-value by the number of variables examined at thenode. Eleven significant splits were found although a high percentage,79.5% (70/88), of the most active molecules are found in only 3 terminalnodes (shaded in gray).

[0041] To facilitate the understanding of the splits of the dataobtained from recursive partitioning, it was necessary to have amolecular viewer which could not only display molecules, but highlightthe portions of the molecules described in the rules. SCAM is not lockedinto displaying only one type of descriptor, but rather passes thedescriptor variables path to a node to an external program whichhighlights the appropriate atoms or bonds and then passes the structurealong to a viewer. To SCAM, descriptors are just strings, and it is upto external programs to interpret the results and display them. Theexternal programs can be specified by simply specifying externalenvironment variables.

[0042] SCAM has an option that allows the user to enter a MDL SD-filecontaining the structures for the compounds. Rather than reading themdirectly into memory, as the files can be quite huge, a list of seekindices is computed once on the SD-file. Then, whenever the userrequests to see the compounds at a node, it is a simple matter ofperforming seeks to the appropriate offsets in the SD file to obtain thecompounds of interest.

[0043] When examining the RP classification tree, it is often of greatinterest to see the distribution of potencies at a node and to see how asplit at a node divides up the potencies at the two daughter nodes. Anon-parametric density plot is available to display the potencydistribution at the node, with the potency distribution of the twodaughter nodes overlaid in different colors The density plot isperformed by weighting each point by a Gaussian kernel function with aconfigurable bandwidth. If the assay variability is known, then theassay standard deviation can be used for the bandwidth.

[0044] AT Tree

[0045] Once the analysis has been completed, a file describing the rulesthat create an RP tree can be written to disk, and a utility program,Pachinko, can be invoked on a new dataset to find where the compounds inthat dataset would fall in the classification tree. Thus, a set ofcompounds can be screened, analyzed with SCAM producing a classificationtree, and then a whole corporate chemical compound collection, or evenvirtual chemical compound libraries can be dropped down the tree tosuggest additional compounds for biological screening. With Pachinko itis also possible to divide data into training and validation datasets totest the predictive powers of the tree.

[0046] With a large number of descriptor variables, it is often the casethat there is more than one descriptor that would give rise to the samesplit at a node. These variables are considered to be perfectlycorrelated. When the variable associated with the most significant splithas other perfectly correlated variables, all such descriptors at thenode are stored so that these rules can later be used for as input tothe Pachinko program. In the dataset used to create the tree, allcorrelated variables will be found within the structures at a rightnode, though, in theory, only one would be necessary in order for somenovel structure to be placed there. Within the Pachinko program, thereis an option to either force all correlated variables to match for arule to be satisfied, or else to have any one matching descriptor forthe right path in a tree to be taken.

[0047] There is now set forth a pseudocode example for carrying out theSCAM function. SCAM is implemented in C code using the XVT DevelopmentSolution for C, a tool for building Graphical User Interfaces in C. SCAMis menu-driven.

[0048] 1. File Menu

[0049] File commands are used to import the data files associated withSCAM, enter documentation, and send print output to a file.

[0050] 1.1 Import

[0051] read the .dat file and store compound names and potencies inarrays;

[0052] read the .des file and store descriptor codes and names inarrays;

[0053] read the .bit file and create a matrix which has a row for eachdescriptor and, in each row, an array of indices (into the compoundsarray) of all compounds that have that descriptor;

[0054] 1.2 Read Structures

[0055] calculate a set of seek indices into an SD file so that molecularstructure information can be accessed quickly;

[0056] 1.3 Edit Information Box

[0057] allow the user to input information about the data set beinganalyzed;

[0058] 1.4 Print Tree

[0059] write the current tree to a postscript file for later printing;

[0060] 1.5 Quit

[0061] quit SCAM;

[0062] 2 Menu Tree

[0063] Most of the options in the tree menu operate on the currentlyactive node, which the user indicates by positioning cursor over a nodeand clicking the left mouse button.

[0064] 2.1 Split Node

[0065] split the active node into two children nodes using thedescriptor which provides the most statistically significant split;

[0066] bonferroni:=number of descriptors;

[0067] tbest:=0; /*holds the t-statistic for the best split*/

[0068] for every descriptor in the data set do { split the compounds inthe active node into two groups according to whether or not they havethe descriptor; if the descriptor appears in no or all compounds thenbonferroni : = bonferroni − 1; else { calculate the t-statistic for thissplit: t = where: χ = mean potency of compounds in left or right child σ= standard deviation of compound potencies of node being split η =number of compounds in left or right child if    /*largest t-statisticindicates the most significant split * tbest : = t } }

[0069] compute the pvalue from tbest and multiply this by the bonferonniadjustment to get a value indicating the significance of the split;

[0070] 2.2 Delete Subtree

[0071] delete the subtree rooted at the currently active node;

[0072] 2.3 Split Subtree Recursively

[0073] while (tree depth from active node<maximum-depth AND

[0074] further splits can be found) do

[0075] split a terminal node of the tree rooted at the currently activenode

[0076] 2.4 View Structures

[0077] filter an SD file containing the compounds in the active nodethrough an external progra which highlights the atoms in the compoundsthat correspond to the descriptor variables (including correlated ones)that got the compound to that node;

[0078] send the filtered SD file to a viewer program (Project View);

[0079] 2.5 Structures→Clipboard

[0080] copy the structures at the active node to the clipboard in theform of an SD file;

[0081] 2.6 Save Structures

[0082] write all structures (with atom highlighting-see Section 2.4)within the active node to an file;

[0083] 2.7 List Node

[0084] write a list of the compounds and potencies within the activenode to an external file;

[0085] 2.8 Node Potency Histogram

[0086] draw a non-parametric density plot of the potencies of the activenode;

[0087] 2.9 Write Pachinko Subtree Rules

[0088] write the rules that generated the tree rooted at the active nodeto an external file;

[0089] 2.10 Create .dat File for Node

[0090] create a .dat file for the compounds in the active node;

[0091] 2.11 Options

[0092] review and/or alter the options (split method, minimum splitsize, split significance, maximum tree depth, potency thresholds forhighlighting) that determine how nodes are split and how the tree isdisplayed;

[0093] Copyright 1997 by Glaxo Wellcome, Inc., all rights reserved,except as stated above.

[0094] There is now set forth a pseudocode example for carrying out thefunction of prediction of activity of a molecule by Pachinko if rulesfrom SCAM/Recursive Partitioning have been previously stored.

[0095] For each rule used to split data;

[0096] input Node Tree Position;

[0097] input Node Average;

[0098] input Node Number Rules;

[0099] input Node Rule Set:

[0100] For each object to be predicted

[0101] Current Tree Position:=“N”;

[0102] Object Activity:=Node Average at Current Tree Position;

[0103] Input Object Name;

[0104] Input Object Rule Set;

[0105] While Node Number Rules at Current Tree Position is greater than0

[0106] for every rule r_(i), in Node Rule Set at Current Tree Position

[0107] if r_(i) is not an element of ObjectRule Set at Current TreePosition

[0108] Current Tree Position:=Current Tree Position +“0”;

[0109] next Rule Set;

[0110] Current Tree Position:=CurrentTree Position =“1”;

[0111] Object Activity:=Node Average at Current Tree Position;

[0112] print Object Name, Object Activity;

[0113] Copyright 1997, 1998 by Glaxo Wellcome, Inc., all rights reservedexcept as stated above

What is claimed is:
 1. A computer-based method of encoding features ofdata objects, and of identifying and correlating individual saidfeatures to a response characteristic that is a trait of interest of thedata object, applicable to data objects in a data set that ischaracterized in being a mixture of data object classes, each dataobject class containing one or more of said data objects, and whereinmultiple data objects present a same or similar value of the trait ofinterest, but classes of data objects produce the responsecharacteristic that is a trait of interest through different underlyingmechanisms, comprising the steps of: (a) assembling a set of descriptorsand converting said set of descriptors into the form of a bit stringsuch that each descriptor reflects the presence or absence of any givenpotentially useful feature of interest in a data object of interest; (b)examining each data object for presence or absence of each of saiddescriptors; (c) assembling the results of step (b) into a vector foreach data object, noting the presence or absence of each feature ofinterest in said data object; (d) assembling all vectors generated instep (c) into a matrix with each row of the matrix corresponding to adata object and each column corresponding to a feature of interest; (e)dividing the data in said matrix into two daughter sets on the basis ofpresence or absence of a given feature of interest from said set ofdescriptors; and (f) repeating step (e) until each member of said matrixhas been identified in terms of presence or absence of any given featureof interest from said set of descriptors and each of said members hasbeen assigned to a terminal node.
 2. A computer-based apparatus systemfor allowing a user thereof to encode features of data objects, and toidentify and correlate individual said features to a responsecharacteristic that is a trait of interest of the data object,applicable to data objects in a data set that is characterized in beinga mixture of data object classes, each data object class containing oneor more of said data objects, and wherein multiple data objects presenta same or similar trait of interest, but classes of data objects producethe response characteristic that is a trait of interest throughdifferent underlying mechanisms, comprising: (a) input means responsiveto operator commands enabling an operator to specify a set ofdescriptors that are subsequently converted into a bit-string, such thateach descriptor reflects the presence or absence of a potentially usefulfeature of interest in a data object of interest; (b) storage means forstoring the assembled set of (a); (c) memory means for executingprogrammed steps that examine each data object for presence or absenceof each of said descriptors; (d) means for assembling the results of (c)into a virtual matrix with each row of the matrix corresponding to anobject and each column corresponding to a feature of interest; (e) meansfor assigning each data object in said matrix recursively into one oftwo defined categories on the basis of presence or absence of a givenfeature of interest from said set of descriptors and repeating suchanalysis until each member of said mixture has been identified in termsof presence or absence of features of interest from said set ofdescriptors and assigned to a terminal node; and (f) output means forvisually displaying, using computer graphics, a relationship of saiddescriptors with said data objects and classes.
 3. A computer softwaresystem having a set of instructions for controlling a general purposedigital computer in performing a desired function comprising: a set ofinstructions formed into each of a plurality of modules, each modulecomprising: (a) an input process responsive to operator commandsenabling an operator to specify a set of descriptors and convert saiddescriptors into a bit string such that each descriptor reflects thepresence or absence of a potentially useful feature of interest of adata object of interest, wherein each data object is a member of a dataset that is characterized in being a mixture of data object classes,each data object class containing one or more of said data objects, andwherein multiple data objects present a same or similar trait ofinterest, but classes of data objects produce the responsecharacteristic that is a trait of interest through different underlyingmechanisms; (b) a data storage process for storing the assembled set of(a); (c) a computational process for executing programmed steps thatexamine each member of said mixture for presence or absence of each ofsaid descriptors; (d) a computational process for assembling the resultsof (c) into a vector for each data object and a matrix for all vectors;(e) a computational process for assigning each data object in saidmatrix into one of two defined categories on the basis of presence orabsence of a given feature of interest from said set of descriptors andrepeating such analysis until each member of said mixture has beenidentified in terms of presence or absence of each feature of interestfrom said set of descriptors and assigned to a terminal node; (f) a datastorage process; and (g) an output process for visually displaying,using computer graphics, a relationship of said descriptors with saiddata objects and classes.
 4. A computer-based method of encoding mixturefeatures of planned mixtures or of inadvertent mixtures, or of acombination of planned or inadvertent mixtures, and of identifying andcorrelating individual said features to a response characteristic of themixture object, wherein said mixture object is in a data set whereinmultiple mixture objects comprising the data set present the same traitof interest through a common underlying mechanism; comprising the stepsof: (a) assembling a set of descriptors and converting said set ofdescriptors into the form of a bit string such that each descriptorreflects the presence or absence of a potentially useful feature ofinterest in a mixture object; (b) examining each mixture object forpresence or absence of each of said descriptors; (c) assembling theresults of step (b) into a vector for each mixture object, noting thepresence or absence of each feature of interest in said mixture object;(d) assembling all vectors generated in step (c) into a matrix with eachrow corresponding to a mixture object and each column corresponding to afeature of interest; (e) dividing the mixture objects in said matrixinto two defined daughter nodes on the basis of presence or absence of agiven feature of interest from said set of descriptors; and (f)repeating step (e) until each mixture object of said matrix has beenidentified in terms of presence or absence of given features of interestfrom said set of descriptors and assigned to a terminal node.
 5. Acomputer-based apparatus system for allowing a user thereof to encodefeatures of planned mixtures or of inadvertent mixtures, or of acombination of planned or inadvertent mixtures, and to identify andcorrelate individual said features to a response characteristic of themixture object, wherein said mixture object is in a data set whereinmultiple mixture objects comprising the data set present the same traitof interest through a common underlying mechanism, comprising: (a) inputmeans responsive to operator commands enabling an operator to specify aset of descriptors that are subsequently converted into a bit string,such that each descriptor reflects the presence or absence of apotentially useful feature of interest in a mixture object of interest;(b) storage means for storing the assembled set of (a); (c) memory meansfor executing programmed steps that examine each mixture object forpresence or absence of each of said descriptors; (d) means forassembling the results of (c) into a virtual matrix with each rowcorresponding to a mixture object and each column corresponding to afeature; (e) means for assigning each mixture object in said matrixrecursively into one of two defined categories on the basis of presenceor absence of a given feature of interest from said set of descriptorsand repeating such analysis until each mixture object of said matrixpopulation has been classified in terms of presence or absence of givenfeatures of interest from said set of descriptors and assigned to aterminal node; and (f) output means for visually displaying, usingcomputer graphics, the relationships of said descriptors with saidmixture classes and mixture objects.
 6. A computer software systemhaving a set of instructions for controlling a general purpose digitalcomputer in performing a desired function comprising: a set ofinstructions formed into each of a plurality of modules, each modulecomprising: (a) an input process responsive to operator commandsenabling an operator to specify a set of descriptors and convert saiddescriptors into a bit string such that each descriptor reflects thepresence or absence of a potentially useful feature of interest in amixture object of interest, wherein each mixture object is a member of adata set where each mixture object presents a same trait of interestthrough a common underlying mechanism; (b) a data storage process forstoring the assembled set of (a); (c) a computational process forexecuting programmed steps that examine each member object of said dataset for presence or absence of each of said descriptors; (d) acomputational process for assembling the results of (c) into a vectorfor each mixture object and a virtual matrix with each row correspondingto a mixture object and each column corresponding to a feature; (e) acomputational process for analyzing the data in said matrix into one oftwo defined categories on the basis of presence or absence of a givenfeature of interest from said set of descriptors and repeating suchanalysis until each member of said mixture has been identified in termsof presence or absence of each feature of interest from said set ofdescriptors and assigned to a terminal node; (f) a data storage process;and (g) an output process for visually displaying, using computergraphics, a relationship of said descriptors with said mixture objectsand classes.
 7. A computer-based method of encoding mixture features ofplanned mixtures or of inadvertent mixtures, or of a combination ofplanned or inadvertent mixtures, and of identifying and correlatingindividual said features to a response characteristic that is a trait ofinterest of the mixture object, wherein said mixture object is in a dataset that is characterized in being a mixture of mixture object classes,each class containing one or more of said mixture objects, and whereinmultiple mixture objects present a same trait of interest, but classesof mixture objects produce the response characteristic which is a traitof interest through different underlying mechanisms, comprising thesteps of: (a) assembling a set of descriptors and converting said set ofdescriptors into the form of a bit string such that each descriptorreflects the presence or absence of a potentially useful feature ofinterest in a mixture object of interest; (b) examining each mixtureobject for presence or absence of each of said descriptors; (c)assembling the results of step (b) into a vector for each mixtureobject, noting the presence or absence of each feature in said dataobject; (d) assembling all vectors generated in step (c) into a matrixwith each row corresponding to a mixture object and each columncorresponding to a feature; (e) dividing the mixture objects in saidmatrix into two defined daughter nodes on the basis of presence orabsence of a given feature of interest from said set of descriptors; and(f) repeating step (e) until each mixture object of said matrix has beenidentified in terms of presence or absence of given features of interestfrom said set of descriptors and assigned to a terminal node.
 8. Acomputer-based apparatus system for allowing a user thereof to encodefeatures of planned mixtures or of inadvertent mixtures, or of acombination of planned or inadvertent mixtures, and to identify andcorrelate individual said features to a response characteristic that isa trait of interest of the mixture object, applicable to mixture objectsin a data set that is characterized in being a mixture of mixture objectclasses, each class containing one or more of said mixture objects, andwherein multiple mixture objects present a same trait of interest, butclasses of mixture objects produce the response characteristic that is atrait of interest through different underlying mechanisms, comprising:(a) input means responsive to operator commands enabling an operator tospecify a set of descriptors that are subsequently converted into a bitstring, such that each descriptor reflects the presence or absence of apotentially useful feature of interest in a mixture object of interest;(b) storage means for storing the assembled set of (a); (c) memory meansfor executing programmed steps that examine each mixture object forpresence or absence of each of said descriptors; (d) means forassembling the results of (c) into a virtual matrix with each rowcorresponding to a mixture object and each column corresponding to afeature; (e) means for assigning each mixture object in said matrixrecursively into one of two defined categories on the basis of presenceor absence of a given feature of interest from said set of descriptorsand repeating such analysis until each mixture object of said matrix hasbeen classified in terms of presence or absence of given features ofinterest from said set of descriptors and assigned to a terminal node;and (f) output means for visually displaying, using computer graphics,the relationships of said descriptors with said mixture objects andclasses.
 9. A computer software system having a set of instructions forcontrolling a general purpose digital computer in performing a desiredfunction comprising: a set of instructions formed into each of aplurality of modules, each module comprising: (a) an input processresponsive to operator commands enabling an operator to specify a set ofdescriptors and convert said descriptors into a bit string such thateach descriptor reflects the presence or absence of a potentially usefulfeature of interest in a mixture object of interest, wherein eachmixture object is a member of a data set that is characterized in beinga mixture of classes, each class containing one or more of said mixtureobjects, and wherein multiple mixture objects present the same trait ofinterest, but classes of mixture objects produce the responsecharacteristic that is a trait of interest through different underlyingmechanisms; (b) a data storage process for storing the assembled set of(a); (c) a computational process for executing programmed steps thatexamine each mixture object of said matrix for presence or absence ofeach of said descriptors; (d) a computational process for assembling theresults of (c) into a vector for each mixture object and a virtualmatrix with each row corresponding to a mixture object and each columncorresponding to a feature; (e) a computational process for assigningeach mixture object in said matrix into one of two defined categories onthe basis of presence or absence of a given feature of interest fromsaid set of descriptors and repeating such analysis until each member ofsaid matrix has been classified in terms of presence or absence of givenfeatures of interest from said set of descriptors and assigned to aterminal node; (f) a data storage process; and (g) an output process forvisually displaying, using computer graphics, a relationship of saiddescriptors with said mixture objects and classes.
 10. A computer-basedmethod of analyzing biological potency of individual chemical structurefeatures out of a plural mixture of chemical compounds wherein a createddata set is characterized in being a mixture of data objects, each dataobject itself being a mixture of active and/or inactive chemicalcompounds, which active chemical compounds exhibit a trait of interest,wherein the underlying mechanisms of activity may be through a single ormultiple mechanisms, comprising the steps of: (a) assembling a set ofdescriptors such that each descriptor captures a chemically usefulfeature of one or more members of a mixture of chemical compounds suchthat one member is captured if individual chemical compounds are beingdecoded, two members are captures if pairs of chemical compounds arebeing decoded, three members are captured if triples of chemicalcompounds are being decoded and so on; (b) examining each member, pairor triple, or so forth, of said mixture of chemical compounds forpresence or absence of each of said features of interest; (c) assemblingthe results of step (b) into a descriptor vector; (d) comparing thefeatures of the individual compound, pair, triple and so forth, to thefeatures of a terminal node of choice and determining a residentterminal node; (e) repeating step (d) until each compound, pair, tripleand so forth of said set of mixtures of chemical compounds has beenidentified and characterized in relation to the terminal node it wouldreside within.
 11. The method as claimed in claims 1, 4, 7 or 10,including the additional step of assembling a chemical structure datafile.
 12. The method as claimed in claim 1, 4, 7 or 10, including theadditional step of assembling biological data pertaining to eachchemical mixture or mixture of chemicals and assigning each chemicalmixture its biological data.
 13. The method as claimed in claim 1, 4, 7or 10, in which said correlation is between presence or absence of oneor more chemical descriptors and biological activity of a chemicalmixture.
 14. The method as claimed in claim 1, 4, 7, or 10, in whichsaid correlation is between presence or absence of one or more chemicaldescriptors and pharmacological activity of a chemical compound.
 15. Themethod as claimed in claims 1, 4, 7 or 10, including the additional stepof determining structure-activity relationships, such relationshipscomprising sets of rules defining the sets of features specific to eachactivity class.
 16. The method as claimed in claim 1, in which saiddescriptor is an atom pair.
 17. The method as claimed in claim 1, inwhich said descriptor is an atom triple.
 18. The method as claimed inclaim 17, in which said atom triple is a set of three defined atoms in amolecule of interest, each atom defined by element, by spatial relationto each of the other two atoms, and by the type of chemical bond ornumber of chemical bonds separating them in the molecule.
 19. The methodas claimed in claim 1, in which said descriptor is a molecular fragment.20. The method as claimed in claim 1, in which said descriptor is amolecular topological torsion.
 21. The method as claimed in claim 1, inwhich said descriptor is a measure of thermodynamic stability.
 22. Themethod as claimed in claim 1, in which said descriptor is a binary ofcontinuous variable.
 23. The method as claimed in claim 1, in which saiddescriptor is a combination in any order of an atom pair, an atomtriple, a molecular fragment, a molecular topological torsion,thermodynamic stability or a binary of a continuous variable.
 24. Themethod as claimed in claim 1, in which each descriptor is an element ofa vector in said matrix.
 25. The method as claimed in claim 1, in whichpresence or absence of each feature of interest is represented as a 1 ora 0, respectively.
 26. The method as claimed in claim 24, in which saidvector is computationally represented as a bit string data file.
 27. Themethod as claimed in claim 26, in which said bit string data file isutilized to computationally create a bit string data file.
 28. Themethod as claimed in claim 26, in which said bit string iscomputationally compressed into a sparse matrix.
 29. The method asclaimed in claim 28, in which said sparse matrix is statisticallyanalyzed by recursive partitioning.
 30. The method as claimed in claim29, in which said recursive partitioning is performed by the CARTmethod.
 31. The method as claimed in claim 29, in which said recursivepartitioning is performed by the FIRM method.
 32. The method as claimedin claim 29, in which said recursive partitioning is performed by theC4.5 method.
 33. The method as claimed in claim 31, in which said FIRMmethod is converted from multiway splits to binary splits.
 34. Themethod as claimed in claim 1, including the additional step of selectingthe descriptor that optimally divides said rows of said data matrix intotwo subsets of rows, being either compounds or mixtures of compoundswhere said feature of interest is present or absent, respectively, andrepeating this process through subsequent iterations until alldescriptors in said descriptor set have been examined repeatedly and allsaid rows assigned to terminal nodes.
 35. The method as claimed in claim1, in which the result of said recursive partitioning is graphicallyrepresented as a recursive partitioning analysis tree.
 36. The method asclaimed in claim 1, in which said data objects are discrete compounds.37. The method as claimed in claims 4, 7 or 10, in which said dataobjects are mixtures of discrete compounds.
 38. A computer-based methodof encoding, decoding and identifying individual chemical compounds outof a chemical mixture, comprising the steps of: (a) assembling theresults of previously conducted screening of the chemical mixture for abiological activity of interest; (b) assembling a set of descriptorssuch that each descriptor captures a chemically useful feature of one ormore members of a chemical mixture; (c) examining each combination ofmembers of said chemical mixture for presence or absence of each of saiddescriptors; (d) correlating presence or absence of said chemicaldescriptors with an assigned terminal node, thereby identifyingpredicted activity; and (e) analyzing subsequent chemical mixtures forchemical structure, comparing their chemical structure against saidpredicted activity and extrapolating biological reactivity of suchsubsequent chemical mixtures therefrom.
 39. The method as claimed inclaim 38, including the additional step of assembling a chemicalstructure data file.
 40. The method as claimed in claim 38, includingthe additional step of assembling biological data pertaining to eachchemical compound and assigning each chemical compound or mixture itsbiological data.
 41. The method as claimed in claim 38, in which saidcorrelation is between presence or absence of one or more chemicaldescriptors and biological activity of a chemical compound or mixture.42. The method as claimed in claim 38, in which said correlation isbetween presence or absence of one or more chemical descriptors andpharmacological activity of a chemical compound or mixture.
 43. Themethod as claimed in claim 38, in which said descriptor is an atom pair.44. The method as claimed in claim 38, in which said descriptor is anatom triple.
 45. The method as claimed in claim 44, in which said atomtriple is a set of three defined atoms in a molecule of interest, eachatom defined by element, by spatial relation to each of the other twoatoms, and by the type of chemical bond or number of chemical bondsseparating them in the molecule.
 46. The method as claimed in claim 38,in which said descriptor is a molecular fragment.
 47. The method asclaimed in claim 38, in which said descriptor is a molecular topologicaltorsion.
 48. The method as claimed in claim 38, in which said descriptoris a binary of continuous variables.
 49. The method as claimed in claim38, in which said descriptors are a combination in any order of atompairs, atom triples, molecular fragments, molecular topologicaltorsions, thermodynamic stability descriptors or a binary of continuousvariables.
 50. The method as claimed in claim 38, in which presence orabsence of each feature of intererest is represented as a 1 or a 0,respectively.
 51. The method as claimed in claim 38, in which saidvector is computationally represented as a bit string.
 52. The method asclaimed in claim 38, including the additional step of decoding thechemical compounds in said chemical mixture by reference to said matrixvectors for the mixture.
 53. The method as claimed in claim 38, in whichsaid recursive partitioning is graphically represented as a recursivepartitioning analysis tree.
 54. A computer-based method of encoding,identifying and correlating individual genetic features of a geneticpolymorphism out of a plural populational mixture of individual subjectsso as to identify useful diagnoses and therapies of individuals and inthe identification of genes and gene products useful in definingbiological targets of interest, comprising the steps of: (a) assemblinga set of descriptors such that each descriptor captures a geneticallyuseful feature, allele, alleles, or marker, of one or more members of amixture population of individuals having a phenotype of interest; (b)examining each member of said population of individuals for presence orabsence of each of said genetic features; (c) assembling the results ofstep (b) into a matrix; (d) dividing the data in said matrix into one oftwo defined categories on the basis of presence or absence of a givengenetic features from said set of genetic features; (e) repeating step(d) until each member of said population of individuals has beenidentified and characterized in terms of presence or absence of eachgenetic feature; and (f) correlating presence or absence of said geneticfeatures with known phenotypes of each of said mixture population ofindividuals, thereby deriving a relationship between genotype andphenotype, said relationship useful in diagnosis and therapy ofindividuals and also useful for identification of gene products, saidgene products useful for selecting drug targets or said gene productsuseful for determining the genetic origiSn of a disease.
 55. The methodas claimed in claim 54, including the additional step of assembling apopulational phenotype data file.
 56. The method as claimed in claim 54,in which said descriptor is an identified allele or marker.
 57. Themethod as claimed in claim 54, in which said descriptor is absence of agiven allele or marker.
 58. The method as claimed in claim 54, in whicheach descriptor is an element of a vector in said matrix.
 59. The methodas claimed in claim 54, in which each individual in said population isencoded by a vector in said matrix.
 60. The method as claimed in claim54, in which presence or absence of each descriptor is represented as a1 or a 0, respectively.
 61. The method as claimed in claim 54, in whichsaid matrix vector is computationally represented as a bit string. 62.The method as claimed in claim 54, in which said bit string is utilizedto computationally create a bit string data file.
 63. The method asclaimed in claim 54, in which said bit string is computationallycompressed as a sparse matrix.
 64. The method as claimed in claim 54, inwhich said sparse matrix is statistically analyzed by recursivepartitioning.
 65. The method as claimed in claim 54, in which saidrecursive partitioning is performed by the CART method.
 66. The methodas claimed in claim 54, in which said recursive partitioning isperformed by the FIRM method.
 67. The method as claimed in claim 54, inwhich said recursive partitioning is performed by the C4.5 method. 68.The method as claimed in claim 54, in which said FIRM method isconverted from multiway splits to binary splits.
 69. The method asclaimed in claim 54, including the additional step of selecting thedescriptor that correlates most closely with the highest averageincidence of a phenotype of interest of all individuals in thepopulation that have such a descriptor and creating two subsets ofindividuals where said descriptor is present or absent, respectively,and repeating this process through subsequent iterations until alldescriptors in said descriptor set have been examined and analyzed forprevalence in said population.
 70. The method as claimed in claim 54,including the additional step of decoding the individuals in saidpopulation by reference to said matrix vectors.
 71. The method asclaimed in claim 54, in which said recursive partitioning is graphicallyrepresented as a recursive partitioning analysis tree.
 72. The method asclaimed in claim 54, in which said statistical test for splitting a nodeis a t-test.
 73. The method as claimed in claim 54, in which saidstatistical test for splitting a node is a chi-square test.